Read My Lips: Towards Use of the Microsoft Kinect as a Visual-Only Automatic Speech Recognizer

نویسندگان

Peter McKay

Bryan Clement

Sean Haverty

Elijah Newton

Kevin Butler

چکیده

Consumer devices used in the home are capable of collecting ever more information from users, including audio and video. The Microsoft Kinect is particularly well-designed for tracking user speech and motion. In this paper, we explore the ability of current models of the Kinect to support use as an automatic speech recognizer (ASR). Lip reading is known to be difficult due to the many possible lip motions. Our goals were to quantify lip movement while observing the correlation with recognized words. Our preliminary results show that word recognition through the audio interface and with use of the Microsoft Speech API can provide upwards of 90% accuracy over a corpus of words, and that the visual acuity of the Kinect is such that we can capture a total of 22 data points representing the lip model through the Face Tracking API at a high resolution. Based on these results and that of recent work, we forecast that the Kinect has the ability to act as an ASR and that words can potentially be reconstructed through the observation of lip movement without the presence of sound. Such an ability for household devices to observe and parse communication presents a new set of privacy challenges within the home.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

مدل میکروسکوپی دوگوشی مبتنی بر فیلتر بانک مدولاسیون برای پیش گویی قابلیت فهم گفتار در افراد دارای شنوایی عادی

In this study, a binaural microscopic model for the prediction of speech intelligibility based on the modulation filter bank is introduced. So far, the spectral criteria such as the STI and SII or other analytical methods have been used in the binaural models to determine the binaural intelligibility. In the proposed model, unlike all models of binaural intelligibility prediction, an automatic ...

متن کامل

Bilingual corpus for AVASR using multiple sensors and depth information

In this paper we present the Bilingual Audio-Visual Corpus with Depth information (BAVCD). The database contains utterances of connected digits, spoken by 15 subjects in English and 6 subjects in Greek, and collected employing multiple audio-visual sensors. Among them, of particular interest is the use of the Microsoft Kinect device, which is able to capture facial depth images using the struct...

متن کامل

Continuous-speech phone recognition from ultrasound and optical images of the tongue and lips

The article describes a video-only speech recognition system for a “silent speech interface” application, using ultrasound and optical images of the voice organ. A one-hour audiovisual speech corpus was phonetically labeled using an automatic speech alignment procedure and robust visual feature extraction techniques. HMM-based stochastic models were estimated separately on the visual and acoust...

متن کامل

Lip Tracking Towards an Automatic Lip Reading Approach

Current era is to make the interaction between humans and their artificial partners (Computers) and make communication easier and more reliable. One of the actual tasks is the use of vocal interaction. Speech recognition may be improved by visual information of human face. In literature, the lip shape and its movement are referred to as lip reading. Lip reading computing plays a vital role in a...

متن کامل

Audiovisual Speech Recognition with Articulator Positions as Hidden Variables

Speech recognition, by both humans and machines, benefits from visual observation of the face, especially at low signal-to-noise ratios (SNRs). It has often been noticed, however, that the audible and visible correlates of a phoneme may be asynchronous; perhaps for this reason, automatic speech recognition structures that allow asynchrony between the audible phoneme and the visible viseme outpe...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2013

Read My Lips: Towards Use of the Microsoft Kinect as a Visual-Only Automatic Speech Recognizer

نویسندگان

چکیده

منابع مشابه

مدل میکروسکوپی دوگوشی مبتنی بر فیلتر بانک مدولاسیون برای پیش گویی قابلیت فهم گفتار در افراد دارای شنوایی عادی

Bilingual corpus for AVASR using multiple sensors and depth information

Continuous-speech phone recognition from ultrasound and optical images of the tongue and lips

Lip Tracking Towards an Automatic Lip Reading Approach

Audiovisual Speech Recognition with Articulator Positions as Hidden Variables

عنوان ژورنال:

اشتراک گذاری